Skip to content

Conversation

@jpeeler
Copy link

@jpeeler jpeeler commented Dec 11, 2019

Description of the change:
This adds an additional function in the syncCatalogSource sync chain to ensure the GRPC readiness probes are reporting as ready before attempting any GPRC operations.

Motivation for the change:
The idea is to reduce transient errors as reported on the catalog source status. Unfortunately, I've only been able to reliably improve the catalog sources that are of type grpc and not the ones that are backed by configmaps (which is what the reported bug is specifically about).

There's also some work done to make debugging easier in the future.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

@openshift-ci-robot
Copy link
Collaborator

@jpeeler: This pull request references Bugzilla bug 1768819, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

bug 1768819: reduce GRPC transient failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Dec 11, 2019
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpeeler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 11, 2019
@awgreene
Copy link
Member

/retest

2 similar comments
@jpeeler
Copy link
Author

jpeeler commented Dec 11, 2019

/retest

@jpeeler
Copy link
Author

jpeeler commented Dec 11, 2019

/retest

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 11, 2019
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 11, 2019
}

if len(pods) > 0 {
for _, cond := range pods[0].Status.Conditions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are getting an array of pods but only checking the condition of the pod with index 0 so the first one. Can we can always expect to get one pod in the pods list, or if the first one is ready then the others must be as well? If we only look at the one, can we use a Get instead of a List when making the request on L635?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, good point. I'm pretty sure the mapping is always 1:1...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that Get can only be queried by name, so I'm sticking with List here.

return
}

logger.Infof("backing pod not yet ready")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could add "pod status %s", pods[0].Status.Message to this log message?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a good compromise is to indicate the pod name here? The message seems to be blank in my testing and printing the whole status block is going to be too much.

@exdx
Copy link
Contributor

exdx commented Dec 11, 2019

Interesting how this only works for image based catalog sources but not those based off config maps. The only real difference for the configmap based ones is that we build have to seed the DB manually from the configmap before serving gRPC connections. I wonder if they gRPC healthprobe is trying to query the pod before that action is complete, and after a series of failures it starts to back off and act funny. I know we were looking at something like this in the BZ. Maybe we can test the configmap case with a delay in the readiness probe (until after the gRPC connection is ready) and see if that improves behavior?

Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @jpeeler! Looking nice.

For now, I just have some thoughts on implementation:

return
}

func (o *Operator) waitForBackingPod(logger *logrus.Entry, in *v1alpha1.CatalogSource) (out *v1alpha1.CatalogSource, continueSync bool, syncError error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name makes me think that this method is blocking, when it's not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to checkBackingPodStatus.

chain := []CatalogSourceSyncFunc{
o.syncConfigMap,
o.syncRegistryServer,
o.waitForBackingPod,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this also be executed for "address" type CatalogSources (i.e. no pods to check readiness for)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly totally forgot about catalog sources of that type. I ended up adding a simple early return for that scenario.

return
}

func (o *Operator) waitForBackingPod(logger *logrus.Entry, in *v1alpha1.CatalogSource) (out *v1alpha1.CatalogSource, continueSync bool, syncError error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a member of the registry reconciler?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, but I'm not sure it should be. With the way the code is structured and higher level operations (which I consider pod checking to be) present in the main operator code, it seems best to me where it is. There is a comment to add memoization for ReconcileForSource, which if done later on would make the requested refactoring a bit less invasive.

Comment on lines 643 to 677
for _, cond := range pods[0].Status.Conditions {
if cond.Type == corev1.ContainersReady && cond.Status == corev1.ConditionTrue {
continueSync = true
return
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a member of the pod decorator used by the registry reconciler?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this might be more appropriate if the shared code (in this case pod status checking) was not shared by the majority of catalog source types (configmap, grpc). The recently added if statement to ignore grpc address types isn't too ugly, do you agree?


crc := newCRClient(t)
selector := labels.SelectorFromSet(map[string]string{reconciler.CatalogSourceLabelKey: sourceName})
catSrcWatcher, err := crc.OperatorsV1alpha1().CatalogSources(testNamespace).Watch(metav1.ListOptions{LabelSelector: selector.String()})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! You can also use a field selector on the name instead of using a label.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since awaitPods also uses a label selector, will stick with consistency.

Comment on lines +807 to +1025
for {
select {
case <-ctx.Done():
return
case evt, ok := <-catSrcWatcher.ResultChan():
if !ok {
errExit <- errors.New("watch channel closed unexpectedly")
return
}
if evt.Type == watch.Modified {
catSrc, ok := evt.Object.(*v1alpha1.CatalogSource)
if !ok {
errExit <- errors.New("watch returned unexpected object type")
return
}
t.Logf("connectionState=%v", catSrc.Status.GRPCConnectionState)
if catSrc.Status.GRPCConnectionState != nil {
if catSrc.Status.GRPCConnectionState.LastObservedState == connectivity.Ready.String() {
done <- struct{}{}
}
require.NotEqual(t, connectivity.TransientFailure.String(), catSrc.Status.GRPCConnectionState.LastObservedState)
}
}

}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the watch was started earler, I think this can be merged with the for {...} at the bottom of this test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to miss any state changes, so I really want to start monitoring the watch before the catalog source is created. Since the result chan is unbuffered, I think that's the only safe way to do it, right?

}
op.sources = grpc.NewSourceStore(logger, 10*time.Second, 10*time.Minute, op.syncSourceState)
op.reconciler = reconciler.NewRegistryReconcilerFactory(lister, opClient, configmapRegistryImage, op.now)
op.reconciler = reconciler.NewRegistryReconcilerFactory(lister, opClient, configmapRegistryImage, op.now, op.logger.IsLevelEnabled(logrus.DebugLevel))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards just passing in a logrus.Entry instance instead of a debug boolean

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason why so you can be able to have access to the operator logger in places that aren't currently possible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done so now - at least to the level of generating the pod spec.

@openshift-ci-robot openshift-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2019
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 12, 2019
@jpeeler
Copy link
Author

jpeeler commented Dec 13, 2019

/retest

}
return
}
if err := o.catsrcQueueSet.Requeue(state.Key.Namespace, state.Key.Name); err != nil {
Copy link
Member

@ecordell ecordell Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this just requeuing after every state change? is that desired?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually do so that the status stays in sync with what the grpc connection is reporting. This only adds additional requeues when the connection is reported as ready as requeues were already being done otherwise. I could refactor since a lot of the above is similar if you agree.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and refactored it.


out = in.DeepCopy()

selector := map[string]string{reconciler.CatalogSourceLabelKey: in.GetName()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this use the pod decorator to get this selector?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a way to have access to the decorator at this level, so I refactored to put some commonality into the reconciler package and use that instead.

@jpeeler jpeeler force-pushed the grpc-transient branch 2 times, most recently from aa7152e to b4e2c24 Compare December 13, 2019 20:26
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 13, 2019
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 13, 2019
@jpeeler
Copy link
Author

jpeeler commented Dec 16, 2019

/test unit

Jeff Peeler added 4 commits January 24, 2020 14:22
This is an attempt to eradicate all the transient failures that GRPC
connections occasionally report. These "failures" usually resolve after
about 20 seconds or so, but seeing the word failure is really unsettling
to users.

The fix here is to confirm that a catalog source pod readiness check has
been completed before attempting to set up a GRPC connection.
The flags weren't being parsed correctly and was crashing, so just
remove this unnecessary code.
Also adds some labeling on some test pods, but the new code specifically
ignores them because they are for grpc address pods.
Allow the launched catalog pods to be aware of whether or not the log
level is set to debug level and add additional GRPC logging if so. In
the future other debugging can be added without the somewhat large
amount of changes as seen here.
@jpeeler
Copy link
Author

jpeeler commented Jan 25, 2020

/retest

1 similar comment
@jpeeler
Copy link
Author

jpeeler commented Jan 26, 2020

/retest

@jpeeler
Copy link
Author

jpeeler commented Jan 31, 2020

/retest
If this doesn't pass in a week or so, I'll probably just close it. But if does pass, I think it should be merged!

@jpeeler
Copy link
Author

jpeeler commented Feb 1, 2020

/retest

@ecordell ecordell changed the title bug 1768819: reduce GRPC transient failures reduce GRPC transient failures Feb 4, 2020
@openshift-ci-robot
Copy link
Collaborator

@jpeeler: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

reduce GRPC transient failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 4, 2020
@jpeeler jpeeler closed this Feb 29, 2020
@exdx exdx mentioned this pull request Oct 14, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants